Use GPUArrays accumulation implementation #2813

christiangnrd · 2025-07-20T18:07:27Z

Opened to run benchmarks.

Todo:

Add compat bound when GPUArrays version released

github-actions · 2025-07-20T18:12:57Z

Your PR requires formatting changes to meet the project's style guidelines.
Please consider running Runic (git runic master) to apply these changes.

Click here to view the suggested changes.

diff --git a/perf/array.jl b/perf/array.jl
index 3dbab9816..400de2231 100644
--- a/perf/array.jl
+++ b/perf/array.jl
@@ -54,11 +54,11 @@ let group = addgroup!(group, "reverse")
     group["1d"] = @async_benchmarkable reverse($gpu_vec)
     group["1dL"] = @async_benchmarkable reverse($gpu_vec_long)
     group["2d"] = @async_benchmarkable reverse($gpu_mat; dims=1)
-    group["2dL"] = @async_benchmarkable reverse($gpu_mat_long; dims=1)
+    group["2dL"] = @async_benchmarkable reverse($gpu_mat_long; dims = 1)
     group["1d_inplace"] = @async_benchmarkable reverse!($gpu_vec)
     group["1dL_inplace"] = @async_benchmarkable reverse!($gpu_vec_long)
     group["2d_inplace"] = @async_benchmarkable reverse!($gpu_mat; dims=1)
-    group["2dL_inplace"] = @async_benchmarkable reverse!($gpu_mat_long; dims=2)
+    group["2dL_inplace"] = @async_benchmarkable reverse!($gpu_mat_long; dims = 2)
 end
 
 group["broadcast"] = @async_benchmarkable $gpu_mat .= 0f0
diff --git a/test/runtests.jl b/test/runtests.jl
index b6c479cce..89bf840c9 100644
--- a/test/runtests.jl
+++ b/test/runtests.jl
@@ -5,7 +5,7 @@ using Printf: @sprintf
 using Base.Filesystem: path_separator
 
 using Pkg
-Pkg.add(url="https://github.com/christiangnrd/GPUArrays.jl", rev="accumulatetests")
+Pkg.add(url = "https://github.com/christiangnrd/GPUArrays.jl", rev = "accumulatetests")
 
 # parse some command-line arguments
 function extract_flag!(args, flag, default=nothing; typ=typeof(default))

github-actions

CUDA.jl Benchmarks

Benchmark suite	Current: `3c02fa9`	Previous: `205c238`	Ratio
`latency/precompile`	`43098463154.5` ns	`42934926801` ns	`1.00`
`latency/ttfp`	`7012905021` ns	`7008552789` ns	`1.00`
`latency/import`	`3574306668` ns	`3569139582` ns	`1.00`
`integration/volumerhs`	`9610435` ns	`9606581` ns	`1.00`
`integration/byval/slices=1`	`147160` ns	`147311` ns	`1.00`
`integration/byval/slices=3`	`426070` ns	`426127` ns	`1.00`
`integration/byval/reference`	`145095` ns	`145282` ns	`1.00`
`integration/byval/slices=2`	`286522` ns	`286537` ns	`1.00`
`integration/cudadevrt`	`103592` ns	`103674` ns	`1.00`
`kernel/indexing`	`14293` ns	`14638.5` ns	`0.98`
`kernel/indexing_checked`	`14958` ns	`15045` ns	`0.99`
`kernel/occupancy`	`720.3851351351351` ns	`669.9465408805031` ns	`1.08`
`kernel/launch`	`2162.222222222222` ns	`2202.4444444444443` ns	`0.98`
`kernel/rand`	`18437` ns	`17466` ns	`1.06`
`array/reverse/1d`	`20190` ns	`20143` ns	`1.00`
`array/reverse/2d`	`23777` ns	`24692` ns	`0.96`
`array/reverse/1d_inplace`	`10893` ns	`11332` ns	`0.96`
`array/reverse/2d_inplace`	`13309` ns	`13662` ns	`0.97`
`array/copy`	`21111` ns	`21281` ns	`0.99`
`array/iteration/findall/int`	`118061` ns	`159966.5` ns	`0.74`
`array/iteration/findall/bool`	`98917` ns	`141602` ns	`0.70`
`array/iteration/findfirst/int`	`158577.5` ns	`163419` ns	`0.97`
`array/iteration/findfirst/bool`	`159266.5` ns	`165377` ns	`0.96`
`array/iteration/scalar`	`73974` ns	`76152` ns	`0.97`
`array/iteration/logical`	`175055` ns	`219912.5` ns	`0.80`
`array/iteration/findmin/1d`	`47340` ns	`47580` ns	`0.99`
`array/iteration/findmin/2d`	`96420` ns	`97060` ns	`0.99`
`array/reductions/reduce/Int64/1d`	`46877` ns	`43742.5` ns	`1.07`
`array/reductions/reduce/Int64/dims=1`	`53196` ns	`47519.5` ns	`1.12`
`array/reductions/reduce/Int64/dims=2`	`62497.5` ns	`62503` ns	`1.00`
`array/reductions/reduce/Int64/dims=1L`	`89099` ns	`89134` ns	`1.00`
`array/reductions/reduce/Int64/dims=2L`	`90082.5` ns	`87634.5` ns	`1.03`
`array/reductions/reduce/Float32/1d`	`34719` ns	`35637` ns	`0.97`
`array/reductions/reduce/Float32/dims=1`	`51741` ns	`51967.5` ns	`1.00`
`array/reductions/reduce/Float32/dims=2`	`59582` ns	`59824` ns	`1.00`
`array/reductions/reduce/Float32/dims=1L`	`52550` ns	`52680` ns	`1.00`
`array/reductions/reduce/Float32/dims=2L`	`70184` ns	`70568` ns	`0.99`
`array/reductions/mapreduce/Int64/1d`	`46053` ns	`43514` ns	`1.06`
`array/reductions/mapreduce/Int64/dims=1`	`53641.5` ns	`46605.5` ns	`1.15`
`array/reductions/mapreduce/Int64/dims=2`	`63304.5` ns	`62143.5` ns	`1.02`
`array/reductions/mapreduce/Int64/dims=1L`	`89158` ns	`89174` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=2L`	`87207.5` ns	`87305.5` ns	`1.00`
`array/reductions/mapreduce/Float32/1d`	`34806` ns	`35464` ns	`0.98`
`array/reductions/mapreduce/Float32/dims=1`	`48546` ns	`42505.5` ns	`1.14`
`array/reductions/mapreduce/Float32/dims=2`	`59774` ns	`60252` ns	`0.99`
`array/reductions/mapreduce/Float32/dims=1L`	`52862` ns	`52803` ns	`1.00`
`array/reductions/mapreduce/Float32/dims=2L`	`70614` ns	`70795` ns	`1.00`
`array/broadcast`	`20552` ns	`20737` ns	`0.99`
`array/copyto!/gpu_to_gpu`	`11319` ns	`13192` ns	`0.86`
`array/copyto!/cpu_to_gpu`	`215254.5` ns	`217123` ns	`0.99`
`array/copyto!/gpu_to_cpu`	`283817` ns	`287100` ns	`0.99`
`array/accumulate/Int64/1d`	`80265` ns	`126109` ns	`0.64`
`array/accumulate/Int64/dims=1`	`220793` ns	`84201` ns	`2.62`
`array/accumulate/Int64/dims=2`	`112332` ns	`158968` ns	`0.71`
`array/accumulate/Int64/dims=1L`	`410035` ns	`1710638` ns	`0.24`
`array/accumulate/Int64/dims=2L`	`5155424` ns	`967410.5` ns	`5.33`
`array/accumulate/Float32/1d`	`55731` ns	`109994` ns	`0.51`
`array/accumulate/Float32/dims=1`	`201773` ns	`81343` ns	`2.48`
`array/accumulate/Float32/dims=2`	`92523` ns	`148659` ns	`0.62`
`array/accumulate/Float32/dims=1L`	`245125` ns	`1619411` ns	`0.15`
`array/accumulate/Float32/dims=2L`	`3735231` ns	`699433` ns	`5.34`
`array/construct`	`1260.9` ns	`1288.5` ns	`0.98`
`array/random/randn/Float32`	`47976` ns	`45344` ns	`1.06`
`array/random/randn!/Float32`	`24949` ns	`25330` ns	`0.98`
`array/random/rand!/Int64`	`27300` ns	`27554` ns	`0.99`
`array/random/rand!/Float32`	`8829` ns	`8908.333333333334` ns	`0.99`
`array/random/rand/Int64`	`30165` ns	`30218` ns	`1.00`
`array/random/rand/Float32`	`13153` ns	`13361` ns	`0.98`
`array/permutedims/4d`	`60598.5` ns	`60397` ns	`1.00`
`array/permutedims/2d`	`54811` ns	`54394` ns	`1.01`
`array/permutedims/3d`	`55558` ns	`55362` ns	`1.00`
`array/sorting/1d`	`2760989` ns	`2758561` ns	`1.00`
`array/sorting/by`	`3368803.5` ns	`3368461` ns	`1.00`
`array/sorting/2d`	`1088682` ns	`1089562` ns	`1.00`
`cuda/synchronization/stream/auto`	`1027.6` ns	`1066.6` ns	`0.96`
`cuda/synchronization/stream/nonblocking`	`7564.6` ns	`7691.3` ns	`0.98`
`cuda/synchronization/stream/blocking`	`815.8111111111111` ns	`844.0121951219512` ns	`0.97`
`cuda/synchronization/context/auto`	`1153.8` ns	`1211.4` ns	`0.95`
`cuda/synchronization/context/nonblocking`	`8424.400000000001` ns	`6881.1` ns	`1.22`
`cuda/synchronization/context/blocking`	`894.8888888888889` ns	`924.7692307692307` ns	`0.97`

This comment was automatically generated by workflow using github-action-benchmark.

maleadt · 2025-07-29T07:12:43Z

Well, that's a bit all over the place.

[only benchmarks]

christiangnrd · 2025-07-30T02:59:54Z

Well, that's a bit all over the place.

Indeed..

Hacking in the big mapreduce kernel heuristic for the by-threads or by-block decision into AK we recover most of the performance discrepancy. It's still a regression, but on a 3090 the Int64/dims=1 ratio is 7.3 for AK/master, and 1.7 for (AK with better heuristic)/master. For Float32/dims=1 the ratios are 6.1 and 1.2 respectively. Still need to figure out what's happening with dims=2L, but at least the performance discrepancy won't be as bad once JuliaGPU/KernelAbstractions.jl#631 is figured out and implemented in AK.

maleadt · 2025-07-30T06:50:51Z

1.7 for (AK with better heuristic)/master

That's too bad, and very much at odds with the results I've seen presented on AK.jl at e.g. JuliaCon. I guess the reduction kernel wasn't really optimized properly yet (the paper seems to focus on sorting operations).

christiangnrd · 2025-08-03T21:02:48Z

I guess the reduction kernel wasn't really optimized properly yet

I suspect the dims=2L case could be mitigated by a better heuristic for determining block size than always 256. That test is quite the weird shape and there's a huge improvement with greater block sizes

kshyatt · 2025-08-04T13:47:32Z

@christiangnrd do you think it's worthwhile to open an issue at AK.jl about this (if one isn't open already)?

christiangnrd · 2025-08-05T21:18:50Z

@christiangnrd do you think it's worthwhile to open an issue at AK.jl about this (if one isn't open already)?

Good idea. I opened #60

anicusan · 2025-08-10T01:15:34Z

Do I read this correctly?

For accumulate:

1d is faster (0.51 / 0.64).
Nd is faster for one dim but slower on the other; this might be something to improve on switching between the by-thread and by-block algorithms.

In the other PR (#2815) for mapreduce:

1d is comparable
Nd is much slower for the L cases; faster for one dim but slower on the other otherwise.

This is the same trend as the timings I posted when first implementing N-dimensional reductions (JuliaGPU/AcceleratedKernels.jl#6 (comment)); AK-0.1 didn't have dims :)

@christiangnrd is right, we should definitely improve the heuristic for switching between the by-thread and by-block algorithms. For the innermost reduction kernel though the CUDA.jl algorithm should be superior, and until we have warp sizes and shuffle instructions exposed in KernelAbstractions I don't think we can do much better (implementation and notes here). What is better (and original afaik) in the AK mapreduce is not doing recursive memory allocations when needing multiple kernel launches (switching views into different ends of the same vector here) - memory consumption is bounded and known upfront.

I'll use the L cases for Nd mapreduce to investigate bottlenecks...

christiangnrd force-pushed the noaccum branch from 61e179a to 7e42809 Compare July 20, 2025 18:12

christiangnrd force-pushed the noaccum branch from 7e42809 to f8088b1 Compare July 20, 2025 22:04

github-actions bot reviewed Jul 21, 2025

View reviewed changes

kshyatt added the cuda kernels Stuff about writing CUDA kernels. label Jul 26, 2025

Use GPUArrays accumulation implementation

244c618

christiangnrd force-pushed the noaccum branch from f8088b1 to b21c7ab Compare July 29, 2025 22:11

REMOVE BEFORE MERGE

3c02fa9

[only benchmarks]

christiangnrd force-pushed the noaccum branch from b21c7ab to 3c02fa9 Compare July 30, 2025 00:58

christiangnrd mentioned this pull request Aug 5, 2025

Use heuristic to determine block size instead of fixed value JuliaGPU/AcceleratedKernels.jl#60

Open

Benchmark reverse on bigger arrays

066fc01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use GPUArrays accumulation implementation #2813

Use GPUArrays accumulation implementation #2813

Uh oh!

christiangnrd commented Jul 20, 2025

Uh oh!

github-actions bot commented Jul 20, 2025 •

edited

Loading

Uh oh!

github-actions bot left a comment •

edited

Loading

Uh oh!

maleadt commented Jul 29, 2025

Uh oh!

christiangnrd commented Jul 30, 2025

Uh oh!

maleadt commented Jul 30, 2025

Uh oh!

christiangnrd commented Aug 3, 2025

Uh oh!

kshyatt commented Aug 4, 2025

Uh oh!

christiangnrd commented Aug 5, 2025

Uh oh!

anicusan commented Aug 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Use GPUArrays accumulation implementation #2813

Are you sure you want to change the base?

Use GPUArrays accumulation implementation #2813

Uh oh!

Conversation

christiangnrd commented Jul 20, 2025

Uh oh!

github-actions bot commented Jul 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

CUDA.jl Benchmarks

Uh oh!

maleadt commented Jul 29, 2025

Uh oh!

christiangnrd commented Jul 30, 2025

Uh oh!

maleadt commented Jul 30, 2025

Uh oh!

christiangnrd commented Aug 3, 2025

Uh oh!

kshyatt commented Aug 4, 2025

Uh oh!

christiangnrd commented Aug 5, 2025

Uh oh!

anicusan commented Aug 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

github-actions bot commented Jul 20, 2025 •

edited

Loading

github-actions bot left a comment •

edited

Loading